Leveraging Video Descriptions to Learn Video Question Answering
نویسندگان
چکیده
We propose a scalable approach to learn video-based question answering (QA): to answer a free-form natural language question about the contents of a video. Our approach automatically harvests a large number of videos and descriptions freely available online. Then, a large number of candidate QA pairs are automatically generated from descriptions rather than manually annotated. Next, we use these candidate QA pairs to train a number of video-based QA methods extended from MN (Sukhbaatar et al. 2015), VQA (Antol et al. 2015), SA (Yao et al. 2015), and SS (Venugopalan et al. 2015). In order to handle non-perfect candidate QA pairs, we propose a self-paced learning procedure to iteratively identify them and mitigate their effects in training. Finally, we evaluate performance on manually generated video-based QA pairs. The results show that our self-paced learning procedure is effective, and the extended SS model outperforms various baselines.
منابع مشابه
Uncovering Temporal Context for Video Question and Answering
In this work, we introduce Video Question Answering in temporal domain to infer the past, describe the present and predict the future. We present an encoder-decoder approach using Recurrent Neural Networks to learn temporal structures of videos and introduce a dual-channel ranking loss to answer multiple-choice questions. We explore approaches for finer understanding of video content using ques...
متن کاملQuestion Difficulty Estimation in Community Question Answering Services
In this paper, we address the problem of estimating question difficulty in community question answering services. We propose a competition-based model for estimating question difficulty by leveraging pairwise comparisons between questions and users. Our experimental results show that our model significantly outperforms a PageRank-based approach. Most importantly, our analysis shows that the tex...
متن کاملA Method to Reduce Effects of Packet Loss in Video Streaming Using Multiple Description Coding
Multiple description (MD) coding has evolved as a promising technique for promoting error resiliency of multimedia system in real-time application programs over error-prone communicational channels. Although multiple description lattice vector quantization (MDCLVQ) is an efficient method for transmitting reliable data in the context of potential error channels, this method doesn’t consider disc...
متن کاملVideo Question Answering via Hierarchical Spatio-Temporal Attention Networks
Open-ended video question answering is a challenging problem in visual information retrieval, which automatically generates the natural language answer from the referenced video content according to the question. However, the existing visual question answering works only focus on the static image, which may be ineffectively applied to video question answering due to the lack of modeling the tem...
متن کاملTraffic Scene Analysis using Hierarchical Sparse Topical Coding
Analyzing motion patterns in traffic videos can be exploited directly to generate high-level descriptions of the video contents. Such descriptions may further be employed in different traffic applications such as traffic phase detection and abnormal event detection. One of the most recent and successful unsupervised methods for complex traffic scene analysis is based on topic models. In this pa...
متن کامل